Skip to content

Conversation

@Fiona-Waters
Copy link
Contributor

@Fiona-Waters Fiona-Waters commented Oct 7, 2025

What this PR does / why we need it:

This PR introduces a unified ContainerBackend that automatically detects and uses either Docker or Podman for local training execution. This replaces the previous separate LocalDockerBackend and LocalPodmanBackend implementations with a single, cleaner abstraction. You can see the Docker and Podman implementations in separate commits.

This implementation tries Docker first, then falls back to Podman if Docker is unavailable. This can be overridden via ContainerBackendConfig.runtime to force a specific runtime ("docker" or "podman"). An error is raised if neither runtime is available.
Unit tests for the backend implementation have also been added. Examples for using Docker and Podman will be added to the Trainer repo later.

Manually testing on Mac I had to specify the container_host like so:
Docker via Colima container_host=f"unix://{os.path.expanduser('~')}/.colima/default/docker.sock"
Podman Desktop container_host=f"unix://{os.path.expanduser('~')}/.local/share/containers/podman/machine/podman.sock"

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes ##114 and #108

Checklist:
I need to look at adding docs. A README has been included.

  • Docs included if any changes are user facing

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @Fiona-Waters!
As we discussed here: #111 (comment), can we consolidate Podman and Docker under single container backend ?
Given that those backend should have similar APIs, I think it would be better to consolidate them, similar to KFP: https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/#runner-dockerrunner

@Fiona-Waters
Copy link
Contributor Author

Thank you for this @Fiona-Waters! As we discussed here: #111 (comment), can we consolidate Podman and Docker under single container backend ? Given that those backend should have similar APIs, I think it would be better to consolidate them, similar to KFP: https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/#runner-dockerrunner

Thanks @andreyvelich I will look at updating the implementation.

@Fiona-Waters
Copy link
Contributor Author

@andreyvelich @astefanutti regarding comments on this PR and #111 this is what I propose:

We have 3 backends:

  • Kubernetes
  • Subprocess
  • Local Container

For the Local Container backend we automatically try Docker first, then Podman and then fallback to Subprocess if neither runtime is available. We use the adapter pattern with a container client adapter unified interface, and docker and podman specific calls are implemented in separate adapter classes.
There could also be an option where users can force a specific runtime for example:
LocalContainerBackendConfig(runtime="docker")
This implementation will make it easy to add support for other container runtimes in the future, if thats a possibility.
Please let me know what you think. Thanks!
cc @briangallagher

@andreyvelich
Copy link
Member

andreyvelich commented Oct 8, 2025

Sure, that looks great @Fiona-Waters!

fallback to Subprocess if neither runtime is available

Why do we need to fallback to subprocess ?
I would imagine we have 3 backend support, and user decide what they want to use:

KubernetesBackend()
ContainerBackend()
LocalProcessBackend()

In the ContainerBackend users can select:

 ContainerBackend(
  ContainerBackendConfig(container_runtime="docker")
)
or
 ContainerBackend(
  ContainerBackendConfig(container_runtime="podman")
)

@astefanutti
Copy link
Contributor

@Fiona-Waters that sounds to me. I agree the fallback logic may really apply to choose the default container runtime.

Other than that, I'd be inclined to drop the "Local" prefix entirely. Even Kubernetes could run local with KinD, and I doubt the SDK will ever do remote process.

@Fiona-Waters
Copy link
Contributor Author

Sure, that looks great @Fiona-Waters!

fallback to Subprocess if neither runtime is available

Why do we need to fallback to subprocess ? I would imagine we have 3 backend support, and user decide what they want to use:

KubernetesBackend()
ContainerBackend()
LocalProcessBackend()

In the ContainerBackend users can select:

 ContainerBackend(
  ContainerBackendConfig(container_runtime="docker")
)
or
 ContainerBackend(
  ContainerBackendConfig(container_runtime="podman")
)

Understood. Let me see what I can do. Thank you for the swift reply!

@Fiona-Waters
Copy link
Contributor Author

@Fiona-Waters that sounds to me. I agree the fallback logic may really apply to choose the default container runtime.

Other than that, I'd be inclined to drop the "Local" prefix entirely. Even Kubernetes could run local with KinD, and I doubt the SDK will ever do remote process.

Ok cool. Let me see what I can do. Thank you!

@Fiona-Waters Fiona-Waters changed the title feat: Add Podman backend and sync Docker backend implementation [WIP] feat: Add Podman backend and sync Docker backend implementation Oct 8, 2025
@Fiona-Waters Fiona-Waters force-pushed the podman-backend branch 4 times, most recently from bdde877 to d1288b2 Compare October 10, 2025 15:55
@Fiona-Waters Fiona-Waters changed the title [WIP] feat: Add Podman backend and sync Docker backend implementation feat: Add Podman backend and sync Docker backend implementation Oct 10, 2025
@Fiona-Waters
Copy link
Contributor Author

@andreyvelich @astefanutti @briangallagher
I've updated the PR. Please take a look. Thanks

@Fiona-Waters Fiona-Waters changed the title feat: Add Podman backend and sync Docker backend implementation feat: Add ContainerBackend with Docker and Podman Oct 10, 2025
@astefanutti
Copy link
Contributor

/ok-to-test

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fiona-Waters thanks for this awesome work!

That looks good to me overall.

/assign @kubeflow/kubeflow-sdk-team @briangallagher

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Fiona-Waters!
I left my initial messages.

@@ -0,0 +1,25 @@
apiVersion: trainer.kubeflow.org/v1alpha1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of installing the runtimes, can we just read the image version from GitHub dynamically ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me look into that. For offline support should we fall back to providing this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fiona-Waters Do we require only image ?
We can fallback to the constant, that we define somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to hardcode constants that will then need to be updated manually later? Something like this:

DEFAULT_FRAMEWORK_IMAGES = {
      "torch": "pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime",
  }

This would then be used to create a default runtime, if the user doesn't provide a runtime url (and we have removed the sample one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed changes to use a default image if a user doesn't provide a URL and have removed the yaml. Please review. Thanks

briangallagher and others added 5 commits October 31, 2025 23:38
@Fiona-Waters
Copy link
Contributor Author

@andreyvelich I have addressed all comments, and rebased. Please take a look when you can. Thanks.

@Fiona-Waters Fiona-Waters force-pushed the podman-backend branch 2 times, most recently from 6edf0e3 to 9bf4cb7 Compare November 1, 2025 00:38
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @Fiona-Waters!
Just a few small nits, overall looks great!
We should be ready to move this forward.

@Fiona-Waters Fiona-Waters force-pushed the podman-backend branch 2 times, most recently from 5df1c72 to 51cf1ed Compare November 1, 2025 10:11
@coveralls
Copy link

Pull Request Test Coverage Report for Build 18998808027

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 79.621%

Totals Coverage Status
Change from base Build 18993446847: 0.0%
Covered Lines: 168
Relevant Lines: 211

💛 - Coveralls

@Fiona-Waters
Copy link
Contributor Author

@andreyvelich I have addressed your most recent comments. We should be good to go now. Let me know if you want me to squash the commits. Thank you!

@andreyvelich
Copy link
Member

Thanks for this great contribution @Fiona-Waters!
/lgtm
/assign @astefanutti @Electronic-Waste @kramaranya

@astefanutti
Copy link
Contributor

Thanks @Fiona-Waters for this awesome work!

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 57d2052 into kubeflow:main Nov 3, 2025
13 checks passed
@Fiona-Waters
Copy link
Contributor Author

Thanks @andreyvelich @astefanutti, delighted to have made this contribution. Thank you, and to all reviewers for their help.
The PR for website guides and the PR for local mode notebook example can be reviewed/merged now. Thanks 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants